首页> 外文OA文献 >Multi-objective Contextual Multi-armed Bandit Problem with a Dominant Objective
【2h】

Multi-objective Contextual Multi-armed Bandit Problem with a Dominant Objective

机译:具有显性特征的多目标语境多臂强盗问题   目的

代理获取
本网站仅为用户提供外文OA文献查询和代理获取服务,本网站没有原文。下单后我们将采用程序或人工为您竭诚获取高质量的原文,但由于OA文献来源多样且变更频繁,仍可能出现获取不到、文献不完整或与标题不符等情况,如果获取不到我们将提供退款服务。请知悉。

摘要

In this paper, we propose a new multi-objective contextual multi-armed banditproblem with two objectives, where one of the objectives dominates the otherobjective. Unlike single-objective bandit problems in which the learner obtainsa random scalar reward for each arm it selects, in the proposed problem, thelearner obtains a random reward vector, where each component of the rewardvector corresponds to one of the objectives and the distribution of the rewarddepends on the context that is provided to the learner at the beginning of eachround. We call this problem contextual multi-armed bandit with a dominantobjective (CMAB-DO). In CMAB-DO, the goal of the learner is to maximize itstotal reward in the non-dominant objective while ensuring that it maximizes itstotal reward in the dominant objective. In this case, the optimal arm given acontext is the one that maximizes the expected reward in the non-dominantobjective among all arms that maximize the expected reward in the dominantobjective. First, we show that the optimal arm lies in the Pareto front. Then,we propose the multi-objective contextual multi-armed bandit algorithm(MOC-MAB), and define two performance measures: the 2-dimensional (2D) regretand the Pareto regret. We show that both the 2D regret and the Pareto regret ofMOC-MAB are sublinear in the number of rounds. We also compare the performanceof the proposed algorithm with other state-of-the-art methods in synthetic andreal-world datasets. The proposed model and the algorithm have a wide range ofreal-world applications that involve multiple and possibly conflictingobjectives ranging from wireless communication to medical diagnosis andrecommender systems.
机译:在本文中,我们提出了一个新的具有两个目标的多目标情境多臂强盗问题,其中一个目标主导另一个目标。与学习者为其选择的每个手臂获得随机标量奖励的单目标强盗问题不同,在提出的问题中,学习者获取随机奖励矢量,其中奖励矢量的每个分量对应于目标之一,奖励的分布取决于在每一轮开始时提供给学习者的上下文中。我们将此问题称为具有优势目标的上下文多臂强盗(CMAB-DO)。在CMAB-DO中,学习者的目标是在非主导目标中最大化其总报酬,同时确保在主导目标中最大化其总报酬。在这种情况下,给定上下文的最优臂是使非主导目标中的预期奖励最大化的所有臂,其中在主导目标中的预期奖励最大化的臂。首先,我们证明最优臂位于帕累托前沿。然后,我们提出了多目标上下文多臂强盗算法(MOC-MAB),并定义了两种性能指标:二维(2D)后悔和帕累托后悔。我们显示,MOC-MAB的2D后悔和帕累托后悔在回合数上都是次线性的。我们还将合成和真实数据集中所提算法的性能与其他最新方法进行了比较。所提出的模型和算法具有广泛的实际应用,涉及从无线通信到医学诊断和推荐系统的多个甚至可能相互冲突的目标。

著录项

  • 作者

    Tekin, Cem; Turgay, Eralp;

  • 作者单位
  • 年度 2017
  • 总页数
  • 原文格式 PDF
  • 正文语种
  • 中图分类

相似文献

  • 外文文献
  • 中文文献
  • 专利
代理获取

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号